The Annals of Statistics 

2009, Vol. 37, No. 5A, 20S3-2108 

DOI: 10.1214/0S-AOS641 

© Institute of Mathematical Statistics, 2009 

FUNCTIONAL LINEAR REGRESSION THAT'S INTERPRETABLE^ 

By Gareth M. James, Jing Wang and Ji Zhu 

University of Southern California, University of Michigan 
and University of Michigan 

Regression models to relate a scalar F to a functional predictor 
X{t) are becoming increasingly common. Work in this area has con- 
centrated on estimating a coefficient function, /3(t), with Y related 
to X{t) through J f3{t)X{t) dt. Regions where P{t) 7^0 correspond 
to places where there is a relationship between X{t) and Y. Alter- 
natively, points where l3{t) = indicate no relationship. Hence, for 
interpretation purposes, it is desirable for a regression procedure to 
be capable of producing estimates of /3(t) that are exactly zero over 
regions with no apparent relationship and have simple structures over 
the remaining regions. Unfortunately, most fitting procedures result 
in an estimate for that is rarely exactly zero and has unnatural 
wiggles making the curve hard to interpret. In this article we intro- 
duce a new approach which uses variable selection ideas, applied to 
various derivatives of /3(t), to produce estimates that are both in- 
terpretable, flexible and accurate. We call our method "Functional 
Linear Regression That's Interpretable" (FLiRTI) and demonstrate 
it on simulated and real-world data sets. In addition, non-asymptotic 
theoretical bounds on the estimation error are presented. The bounds 
provide strong theoretical motivation for our approach. 



1. Introduction. In recent years functional data analysis (FDA) iias be- 
come an increasingly important analytical tool as more data has arisen where 
the primary unit of observation can be viewed as a curve or in general a func- 
tion. One of the most useful tools in FDA is that of functional regression. 
This setting can correspond to either functional predictors or functional 
responses. See Ramsay and Silverman (2002) and Muller and Stadtmuller 
(2005) for numerous specific applications. One commonly studied problem 
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involves data containing functional responses. A sampling of papers ex- 
amining this situation includes Fahrmeir and Tutz (1994), Liang and Zeger 
(1986), Faraway (1997), Hoover et al. (1998), Wu et al. (1998), Fan and Zhang 
(2000) and Lin and Ying (2001). However, in this paper, we are primarily 
interested in the alternative situation, where we obtain a set of observa- 
tions {Xi{t),Yi} for i = 1, . . . ,n, where Xi{t) is a functional predictor and 
Yi a real valued response. Ramsay and Silverman (2005) discuss this sce- 
nario and several papers have also been written on the topic, both for 
continuous and categorical responses, and for linear and nonlinear mod- 
els [Hastie and Mallows (1993), James and Hastie (2001), Ferraty and Vieu 
(2002), James (2002), Ferraty and Vieu (2003), Muller and Stadtmuller (2005), 
James and Silverman (2005)]. 

Since our primary interest here is interpretation, we will be examining the 
standard functional linear regression (FLR) model, which relates functional 
predictors to a scalar response via 



where (5(t) is the "coefficient function." We will assume that Xi{t) is scaled 
such that <t<l. Clearly, for any finite n, it would be possible to perfectly 
interpolate the responses if no restrictions were placed on P{t). Such restric- 
tions generally take one of two possible forms. The first method, which we 
call the "basis approach," involves representing (3{t) using a p-dimensional 
basis function, (3{t) = B(t)-^J7, where p is hopefully large enough to capture 
the patterns in (3{t) but small enough to regularize the fit. With this method 
(1) can be reexpressed as Yi = Po + Xjrj + ei, where Xj = / Xj(t)B(t) dt, and 
T] can be estimated using ordinary least squares. The second method, which 
we call the "penalization approach," involves a penalized least squares es- 
timation procedure to shrink variability in P{t). A standard penalty is of 
the form / (5'^'^\tf dt with d = 2 being a common choice. In this case one 



would find I3{t) to minimize Er=i(^i - /^o - /X,(t)/3(t) dtf + Xj [3^'^\tf dt 



for some A > 0. 

As with standard linear regression, f3{t) determines the effect of Xi{t) on 
Yi. For example, changes in Xi{t) have no effect on Yi over regions, where 
/3(t) = 0. Alternatively, changes in Xi{t) have a greater effect on Yi over 
regions, where \f3{t)\ is large. Hence, in terms of interpretation, coefficient 
curves with certain structures are more appealing than others. For example, 
if (3{t) is exactly zero over large regions then Xi{t) only has an effect on Yi 
over the remaining time points. Additionally, if /3(t) is constant for any given 
non-zero region then the effect of Xi{t) on Yi remains constant within that 
region. Finally, if j3{t) is exactly linear for any given region then the change 
in the effect of Xi{t) is constant over that region. Clearly the interpretation 
of the predictor-response relationship is more difficult as the shape of (3{t) 
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Fig. 1. (a) True beta curve (grey) generated from two quadratic curves and a section 
with pit) = 0. The FLiRTI estimate from constraining the zeroth and third derivative is 
shown in black, (b) Same plot for the region 0.3 < t < 0.7. The dashed line is the best 
B-spline fit. 



becomes more complicated. Unfortunately, the basis and penalization ap- 
proaches both generate P{t) curves that exhibit wiggles and are not exactly 
linear or constant over any region. In addition, P{t) will be exactly equal to 
zero at no more than a few locations even if there is no relationship between 
X{t) and Y for large regions of t. 

In this paper we develop a new method, which we call "Functional Linear 
Regression That's Interpretable" (FLiRTI), that produces accurate but also 
highly interpretable estimates for the coefficient function P{t). Additionally, 
it is computationally efficient, extremely flexible in terms of the form of 
the estimate and has highly desirable theoretical properties. The key to our 
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Fig. 2. Plots of true beta curve (grey) and corresponding FLiRTI estimates (black). For 
each plot we constrained (a) zeroth and second derivative, (b) zeroth and third derivative. 
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procedure is to reformulate the problem as a form of variable selection. In 
particular we divide the time period up into a fine grid of points. We then use 
variable selection methods to determine whether the dth derivative of 
is zero or not at each of the grid points. The implicit assumption is that the 
dth. derivative will be zero at most grid points, that is, it will be sparse. By 
choosing appropriate derivatives one can produce a large range of highly in- 
terpretable (3{t) curves. Consider, for example. Figures 1-3, which illustrate 
a range of FLiRTI fits applied to simulated data sets. In Figure 1 the true 
f3(t) used to generate the data consisted of a quadratic curve and a section 
with (3(t) = 0. Figure 1(a) plots the FLiRTI estimate, produced by assum- 
ing sparsity in the zeroth and third derivatives. The sparsity in the zeroth 
derivative generates the zero section while the sparsity in the third deriva- 
tive ensures a smooth fit. Figure 1(b) illustrates the same plot concentrating 
on the region between 0.3 and 0.7. Notice that the corresponding B-spline 
estimate, represented by the dashed line, provides a poor approximation for 
the region, where P{t) = 0. It is important to note that we did not specify 
which regions would have zero derivatives, the FLiRTI procedure is capable 
of automatically selecting the appropriate shape. In Figure 2 f3{t) was chosen 
as a piecewise linear curve with the middle section set to zero. Figure 2(a) 
shows the corresponding FLiRTI estimate generated by assuming sparsity 
in the zeroth and second derivatives and is almost a perfect fit. Alterna- 
tively, one can produce a smoother, but slightly less easily interpretable fit, 
by assuming sparsity in higher-order derivatives. Figure 2(b), which concen- 
trates on the region between t = 0.2 to t = 0.8, plots the FLiRTI fit assuming 
sparsity in the zeroth and third derivative. Notice that the sparsity in the 
third derivative induces a smoother estimate with little sacrifice in accuracy. 
Finally, Figure 3 illustrates a FLiRTI fit applied to data generated using a 
simple cubic (3{t) curve, a situation where one might not expect FLiRTI to 
provide any advantage over a standard approach such as using a B-spline 
basis. However, the figure, along with the simulation results in Section 5, 
shows that even in this situation the FLiRTI method gives highly accurate 
estimates. These three examples illustrate the flexibility of FLiRTI in that 
it can produce estimates ranging from highly interpretable simple linear fits, 
through smooth fits with zero regions, to more complicated nonlinear struc- 
tures with equal ease. The key idea here is that, given a strong signal in 
the data, FLiRTI is flexible enough to estimate f3{t) accurately. However, in 
situations where the signal is weaker, FLiRTI will automatically shrink the 
estimated P{t) towards a more interpretable structure. 

The paper is laid out as follows. In Section 2 we develop the FLiRTI model 
and also detail two fitting procedures, one making use of the lasso [Tibshirani 
(1996)] and the other utilizing the Dantzig selector [Candes and Tao (2007)]. 
The theoretical developments for our method are presented in Section 3 
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Fig. 3. (a) True beta curve (grey) generated from a cubic curve. The FLiRTI estimate 
from constraining the zeroth and fourth derivative is represented by the solid black line and 
the B-spline estimate is the dashed line, (b) Estimation errors using the B-spline method 
(dashed) and FLiRTI (solid). 



where we outline both nonasymptotic bounds on the error as well as asymp- 
totic properties of our estimate as n grows. Then, in Section 4, we extend 
the FLiRTI method in two directions. First we show how to control multi- 
ple derivatives simultaneously, which allows us to, for example, produce a 
f3{t) curve that is exactly zero in certain sections and exactly linear in other 
sections. Second, we develop a version of FLiRTI that can be applied to 
generalized linear models (GLM). A detailed simulation study is presented 
in Section 5. Finally, we apply FLiRTI to real world data in Section 6 and 
end with a discussion in Section 7. 



2. FLiRTI methodology. In this section we first develop the FLiRTI 
model and then demonstrate how we use the lasso and Dantzig selector 
methods to fit it and hence estimate P{t). 

2.1. The FLiRTI model. Our approach borrows ideas from the basis 
and penalization methods but is rather different from either. We start in 
a similar vein to the basis approach by selecting a p-dimensional basis 
B(t) = [bi{t),b2{t), . . . ,bp{t)]'^ . However, instead of assuming B(t) provides 
a perfect fit for (3{t), we allow for some error using the model 

(2) /J(i) = B(tfr7 + e(t), 

where e(t) represents the deviations of the true (3(t) from our model. In 
addition, unlike the basis approach where p is chosen small to provide some 
form of regularization, we typically choose n so \e{t)\ can generally be 
assumed to be small. In Section 3 we show that the error in the estimate 
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for that is, — can potentially be of order ^J\og{p)/n. Hence, 

low error rates can be achieved even for values of p much larger than n. 
Our theoretical results apply to any high dimensional basis, such as splines, 
Fourier or wavelets. For the empirical results presented in this paper we 
opted to use a simple grid basis, where hk{t) equals 1 \i t ^ Rk = {t: < 

t<-\ and otherwise. 
— pi 

Combining (1) and (2) we arrive at 

(3) y, = /?o + XfT7 + e*, 

where Xj = / Xj(t)B(t) dt and e* = + / Xi{t)e{t) dt. Estimating r] presents 
a difficulty because p> n. One could potentially estimate rj using a variable 
selection procedure except that for an arbitrary basis, B(i), there is no rea- 
son to suppose that rj will be sparse. In fact for many bases rj will contain no 
zero elements. Instead we model (3{t) assuming that one or more of its deriva- 
tives are sparse, that is, (3^'^\t) = over large regions of t for one or more 

values of d = 0, 1, 2, This model has the advantage of both constraining 

rj enough to allow us to fit (3) as well as producing a highly interpretable 
estimate for P{t). For example, P^^\t) = guarantees X{t) has no effect on 
Y at t, (5^^\t) = implies that (3{t) is constant at t, P^'^\t) = means that 
f3{t) is linear at t, etc. 

Let A=[D'^B{ti),D'^B{t2),...,D'^B{tp)]'^, where ti,t2,...,tp represent 
a grid of p evenly spaced points and D'^ is the dth finite difference opera- 
tor, that is, DB{tj) = p[B{tj) - B{tj^i)], D'^B{tj) =p^[B{tj) - 2B(tj_i) + 
B(tj_2)], etc. Then, if 

(4) 7 = Arj, 

jj provides an approximation to I3^'^\tj) and hence, enforcing sparsity in 7 
constrains ^^'^'^{tj) to be zero at most time points. For example, one may 
believe that /J^^) (t) 

= over many regions of t, that is, [3{t) is exactly linear 
over large regions of t. In this situation we would let 

(5) A = [D^B{ti),D^B{t2), . . . , D^B{tp)f, 

which implies 7j = p^[B(tj)^?7 — 2B(tj„i)^T7 + B(tj_2)^'7]- Hence, provided 
p is large, so t is sampled on a fine grid, and e{t) is smooth, 7j [)^'^\tj). 
In this case enforcing sparsity in the 7j's will produce an estimate for (3{t) 
that is linear except at the time points corresponding to nonzero values of 

Ti- 
ll A is constructed using a single derivative, as in (5), then we can always 

choose a grid of p different time points, ti,t2; ■ ■ ■ ,tp such that A is a square 
p hy p invertible matrix. In this case rj = A~^'y so we may combine (3) and 
(4) to produce the FLiRTI model 



(6) 
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where V = [1\XA^^], 1 is a vector of ones and Pq has been incorporated 
into 7. We discuss the situation with multiple derivatives, where A may no 
longer be invertible, in Section 4. 

2.2. Fitting the model. Since 7 is assumed sparse, one could potentially 
use a variety of variable selection methods to fit (6). There has recently 
been a great deal of development of new model selection methods that work 
with large values of p. A few examples include the lasso [Tibshirani (1996), 
Chen, Donoho and Saunders (1998)], SCAD [Fan and Li (2001)], the Elastic 
Net [Zou and Hastie (2005)], the Dantzig selector [Candes and Tao (2007)] 
and VISA [Radchenko and James (2008)]. We opted to explore both the 
lasso and Dantzig selector for several reasons. First, both methods have 
demonstrated strong empirical results on models with large values of p. Sec- 
ond, the LARS algorithm [Efron et al. (2004)] can be used to efficiently 
compute the coefficient path for the lasso. Similarly the DASSO algorithm 
[James, Radchenko and Lv (2009)], which is a generalization of LARS, will 
efficiently compute the Dantzig selector coefficient path. Finally, we demon- 
strate in Section 3 that identical non-asymptotic bounds can be placed on 
the errors in the estimates of (3{t) that result from either approach. 

Consider the linear regression model Y = X/d + s. Then the lasso estimate, 
/3l, is defined by 

(7) 3L = argmmi||Y-X/3||^ + A||/3||i, 

where || • ||i and || • II2 respectively denote the Li and L2 norms and A > 
is a tuning parameter. Alternatively, the Dantzig selector estimate, /3ds> is 
given by 

(8) 3Ds = argmin||/3||i subject to |Xj(Y-X/3)| < A, j = l,...,p, 

where Xj is the jth column of X and A > is a tuning parameter. 

Using either approach the LARS or DASSO algorithms can be used to 
efficiently compute all solutions for various values of the tuning parameter, 
A. Hence, using a validation/cross- validation approach to select A can be 
easily implemented. To generate the final FLiRTI estimate we first produce 
7 by fitting the FLiRTI model, (6), using either the lasso, (7) or the Dantzig 
selector, (8). Note that both methods generally assume a standardized design 
matrix with columns of norm one. Hence, we first standardize V, apply the 
lasso or Dantzig selector and then divide the resulting coefficient estimates 
by the original column norms of V to produce 7. After the coefficients, 7, 
have been obtained we produce the FLiRTI estimate for (3{t) using 

(9) m = B{tfrj = B{tfA-'^(^_,^, 
where 7(_i) represents 7 after removing the estimate for Pq. 
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3. Theoretical results. In this section we show that not only does the 
FLiRTI approach empirically produce good estimates for P(t), but that for 
any p hy p invertible A we can in fact prove tight, nonasymptotic bounds 
on the error in our estimate. In addition, we derive asymptotic rates of 
convergence. Note that for notational convenience the results in this section 
assume Pq = and drop the intercept term from the model. However, the 
theory all extends in a straightforward manner to the situation with Pq 
unknown. 

3.1. A nonasymptotic bound on the error. Let -y^i^ correspond to the lasso 
solution using tuning parameter A. Let D\ be a diagonal matrix with jth 
diagonal equal to 1,-1 or depending on whether the jth component of 7a 
is positive, negative or zero, respectively. Consider the following condition 
on the design matrix, V. 

(10) u=(DA'i^^^^A)"^l>0 and IIV^^V^-DauIIoo < 1, 

where V corresponds to V after standardizing its columns, 1 is a vec- 
tor of ones and the inequality for vectors is understood componentwise. 
James, Radchenko and Lv (2009) prove that when (10) holds the Dantzig 
selector's nonasymptotic bounds [Candes and Tao (2007)] also apply for the 
lasso. Our Theorem 1 makes use of (10) to provide a nonasymptotic bound 
on the L2 error in the FLiRTI estimate, using either the Dantzig selector 
or the lasso. Note that the values 6,6 and Cn,p{t) are all known constants 
which we have defined in the proof of this result provided in Appendix A. 

Theorem 1. For a given p- dimensional basis Bp(t), let ujp = sup^ |ep(t)| 
and 7p = Arj^, where A is a p by p matrix. Suppose that 7^ has at most 
Sp non-zero components and 623^ ~^^Sp2Sp ^ ^- P'^f^her, suppose that we 
estimate [3{t) using the FLiRTI estimate given by (9) using any value of A 
such that (10) holds and 

(11) max|y^e*|<A. 
Then, for every < t < 1, 

(12) \-^(t)-i3{t)\<^Cn,p{t)\Js'p + ujp. 

The constants 5 and are both measures of the orthogonality of V . The 
closer they are to zero the closer V is to orthogonal. The condition 823^ + 
0^ 2S '^^^ which is utilized in the paper of Candes and Tao (2007), ensures 
that /3(t) is identifiable. It should be noted that (10) is only required when 
using the lasso to compute FLiRTI. The above results hold for the Dantzig 
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selector even if (10) is violated. While this is a slight theoretical advantage 
for the Dantzig selector, our simulation results suggest that both methods 
perform well in practice. Theorem 1 suggests that our optimal choice for A 
would be the lowest value such that (11) holds. Theorem 2 shows how to 
choose A such that (11) holds with high probability. 

Theorem 2. Suppose that ~ A^(0, erf) and that there exits an M < oo 
such that J \Xi{t)\dt < M for alii. Then for any 4>>0, ifX = ai\/2{l + cp) logp+ 
Mu}py/n then (11) will hold with probability at least 1 — {p'^^yA7r{l + (/>) logp)~^ , 
and hence 

\P{t)-(3{t)\ < -^a,p(i)cTi725p(l + <^)logp 

(13) 

+ u;p{l + Cn,p{t)M^p}. 

In addition, if we assume e* ~ A^(0,<T2) then (11) will hold with the same 
probability for A = (T2-\/2(1 + (p) logp in which case 

(14) \p{t)-p{t)\<^Cn,p{t)^2^'^Spil + cl))logp + iOp. 

y ri 

Note that (13) and (14) are non-asymptotic results that hold, with high 
probability, for any re or p. One can show that, under suitable conditions, 
Cn,p{t) converges to a constant as re and p grow. In this case the first terms 
of (13) and (14) are proportional to \J\ogp/n while the second terms will 
generally shrink as uop declines with p. For example, using the piecewise 
constant basis it is easy to show that ujp converges to zero at a rate of 1/p 
provided (3' it) is bounded. Alternatively using a piecewise polynomial basis 
of order d then ujp converges to zero at a rate of 1/p'^'^'^ provided jS^'^'^'^^t) 
is bounded. 



3.2. Asymptotic rates of convergence. The bounds presented in Theo- 
rem 1 can be used to derive asymptotic rates of convergence for (3{t) as re 
and p grow. The exact convergence rates are dependent on the choice of 
Bp(t) and A so we first state A-1 through A-6, which give general condi- 
tions for convergence. We show in Theorems 3-6 that these conditions are 
sufficient to guarantee convergence for any choice of Bp(t) and A, and then 
Corollary 1 provides specific examples where the conditions can be shown 
to hold. Let an,p{t) = (1 — 5^^'^ + ^5 2S')C'n,p- 

A-1 There exists S <oo such that Sp< S for all p. 

A-2 There exists m > such that p'^ojp is bounded, that is, Up < H/p^ 
for some H < 00. 
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A-3 For a given t, there exists bt such that p~^* an,p{t) is bounded for ah 
n and p. 

A-4 There exists c such that p~'^supf an,p(i) is bounded for all n and p. 

A-5 There exists a p* such that 525^* + ^s^2S bounded away from one 
for large enough n. 

A-6 + 0^2S bounded away from one for large enough n, where 

n — > oo, pn — > oo and Pn/^^ — > 0. 

A-1 states that the number of changes in the derivative of P{t) is bounded. 
A-2 assumes that the bias in our estimate for P(t) converges to zero at the 
rate of p™, for some m > 0. A-3 requires that otn,p„{t) grows no faster than 
p**. A-4 is simply a stronger form of A-3. A-5 and A-6 both ensure that 
the design matrix is close enough to orthogonal for (13) to hold and hence 
imposes a form of identifiability on (3{t). For the following two theorems we 
assume that the conditions in Theorem 1 hold, A is set to £71-^2(1 -|- (p) \ogp + 
Mijjp^ and Si ~ A^(0, erf). 

Theorem 3. Suppose A-1 through A-5 all hold and we fixp = p* . Then, 
with arbitrarily high probability, as oo, 

\Pnit)-Pit)\<0{n-'^^) + E4t) 

and 

sup\pn{t)-m\<0{n-^/^)+snpEn{t), 
t t 

where En{t) = ^{1 + C„,p.(i)M^}. 

More specifically, the probability referred to in Theorem 3 converges to one 
as (j)^ oo. Theorem 3 states that, with our weakest set of assumptions, fixing 
p as n — > oo will cause the FLiRTI estimate to be asymptotically within En{t) 
of the true (3{t). En{t) represents the bias in the approximation caused by 
representing [3{t) using a p dimensional basis. 

Theorem 4. Suppose we replace A-5 with A-6. Then, provided bt and 
c are less than m, if we let p grow at the rate of n^^^'^'"^\ 

\^^n{t)-m\=o{-0^) 

and 
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Theorem 4 shows that, assuming A-6 holds, Pnit) will converge to f3{t) at 
the given convergence rate. With additional assumptions, stronger results 
are possible. In the Appendix we present Theorems 5 and 6, which provide 
faster rates of convergence under the additional assumption that e| has a 
mean zero Gaussian distribution. 

Theorems 3-6 make use of assumptions A-1 to A-6. Whether these as- 
sumptions hold in practice depends on the choice of basis function and A 
matrix. Corollary 1 below provides one specific example where conditions 
A-1 to A-4 can be shown to hold. 



Corollary 1 . Suppose we divide the time interval [0, 1] into p equal 
regions and use the piecewise constant basis. Let A be the second difference 
matrix given by (23). Suppose that Xi{t) is bounded above zero for all i 
and t. Then, provided (3'{t) is bounded and (3"{t) ^0 at a finite number of 
points, A-1, A-2 and A-3 all hold with m = l, 6o = and bt = 0.5, < t < 1. 
In addition, for t bounded away from one, A-4 will also hold with c = 0.5. 
Hence, if A-5 holds and e* ~ A^(0,(T2), 

\Pnit)-P{t)\<0{n-'/^) + ^ and sup - /3(t)| < 0(^-^2) + 

p* t p 

Alternatively, if A-6 holds and e* ^ N{0,a2), 



Wn{t)-m\ 



n 



1/2 



01^1, o<*<. 



and 



sup \(3nit)-m\=0 



A/log n 



7.1/3 



0<t<l-a \ n 

for any o > O . Similar, though slightly weaker, results hold when e| does not 
have a Gaussian distribution. 



Note that the choice of t = for the faster rate of convergence is simply made 
for notational convenience. By appropriately choosing A we can achieve this 
rate for any fixed value of t or indeed for any finite set of time points. In 
addition, the choice of the piecewise constant basis was made for simplicity. 
Similar results can be derived for higher-order polynomial bases in which 
case A-2 will hold with a higher m and hence faster rates of convergence 
will be possible. 
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4. Extensions. In this section we discuss two useful extensions of the 
basic FLiRTI methodology. First, in Section 4.1, we show how to control 
multiple derivatives simultaneously to allow more flexibility in the types of 
possible shapes one can produce. Second, in Section 4.2, we extend FLiRTI 
to GLM models. 

4.1. Controlling multiple derivatives. So far we have concentrated on 
controlling a single derivative of However, one of the most powerful 
aspects of the FLiRTI approach is that we can combine constraints for mul- 
tiple derivatives together to produce curves with many different properties. 
For example, one may believe that both p(^){t) = and P^'^\t) = over 
many regions of t, that is, (3{t) is exactly zero over certain regions and P{t) 
is exactly linear over other regions of t. In this situation, we would let 

A = [i?OB(fi), L>0B(t2), . . . , D^B{tp),D^B{ti),D^B{t2), . . . , D^B{tp)f . 

(15) 

In general, such a matrix will have more rows than columns so will not be 
invertible. Let A(^i^ represent the first p rows of A and ^(2) the remainder. 
Similarly, let 7^^^ represent the first p elements of 7 and 7^2) the remain- 
ing elements. Then, assuming A is arranged so that is invertible, the 
constraint 7 = At] implies 

(16) 77 = A^^5^(i) and 7(2) = ^(2)^[i)7(i)- 
Hence, (3) can be expressed as 

(17) Yi = Po + {A'^fx,f-f^^)+e*, i = l,...,n. 

We then use this model to estimate 7 subject to the constraint given by 
(16). We achieve this by implementing the Dantzig selector or lasso in a 
similar fashion to that in Section 2 except that we replace the old design 
matrix with V(i) = [1|Xj4^^] and (16) is enforced in addition to the usual 
constraints. 

Finally, when constraining multiple derivatives one may well not wish to 
place equal weight on each derivative. For example, for the A given by (15), 
we may wish to place a greater emphasis on sparsity in the second derivative 
than in the zeroth, or vice versa. Hence, instead of simply minimizing the 
Li norm of 7 we minimize ||r27||i, where is a diagonal weighting matrix. 
In theory a different weight could be chosen for each 7^ but in practice this 
would not be feasible. Instead, for an A such as (15), we place a weight of 
one on the second derivatives and select a single weight, w, chosen via cross- 
validation, for the zeroth derivatives. This approach provides flexibility while 
still being computationally feasible and has worked well on all the problems 
we have examined. The FLiRTI fits in Figures 1-3 were produced using this 
multiple derivative methodology. 
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4.2. FLiRTI for functional generalized linear models. The FLiRTI model 
can easily be adapted to GLM data where the response is no longer assumed 
to be Gaussian. James and Radchenko (2009) demonstrate that the Dantzig 
selector can be naturally extended to the GLM domain by optimizing 

(18) min||/3||i subject to |Xj(Y - /x)| < A, j = l,...,p, 

where fi = g~^{XP) and g is the canonical link function. Optimizing (18) 
no longer involves linear constraints but it can be solved using an iterative 
weighted linear programming approach. In particular, let Ui be the condi- 
tional variance of Yi given the current estimate (3 and let Zi = J2^=i ^ijP + 
{Yi — fii)/Ui. Then one can apply the standard Dantzig selector, using = 
Ziy/TTi as the response and X*j = Xijy/TTi as the predictor, to produce a new 

estimate for (3. James and Radchenko (2009) show that iteratively applying 
the Dantzig selector in this fashion until convergence generally does a good 
job of solving (18). We apply the same approach, except that we iteratively 
apply the modified version of the Dantzig selector, outlined in Section 4.1, to 
the transformed response and predictor variables, using V(i^ = [l\XAjy^ as 
the design matrix. James and Radchenko (2009) also suggest an algorithm 
for approximating the GLM Dantzig selector coefficient path for different 
values of A. With minor modifications, this algorithm can also be used to 
construct the GLM FLiRTI coefficient path. 

4.3. Model selection. To fit FLiRTI one must select values for three dif- 
ferent tuning parameters. A, uj and the derivative to assume sparsity in, d. 
The choice of A and d in the FLiRTI setting is analogous to the choice of 
the tuning parameters in a standard smoothing situation. In the smoothing 
situation one observes n pairs of {xi,yi) and chooses g{t) to minimize 



In this case the second term, which controls the smoothness of the curve, 
involves two tuning parameters, namely A and the derivative, d. The choice 
of d in the smoothing situation is completely analogous to the choice of 
the derivative in FLiRTI. As with FLiRTI, different choices will produce 
different shapes. A significant majority of the time d is set to 2 resulting in 
a cubic spline. If the data is used to select d then the most common approach 
is to choose the values of A and d that produce the lowest cross-validated 
residual sum of squares. 

We adopt the later approach with FLiRTI. In particular we implement 
FLiRTI using two derivatives, the zeroth and a second derivative, d with d 
typically chosen from the values d= 1,2,3,4. We then compute the cross- 
validated residual sum of squares for d= 1,2,3,4 and a grid of values for 



(19) 
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A and to. The final tuning parameters are those corresponding to the low- 
est cross-validated value. Even though this approach involves three tuning 
parameters, it is still computationally feasible because there are only a few 
values of d to test out and, in practice, the results are relatively insensitive 
to the exact value of uj so only a few grids points need to be considered. Ad- 
ditional computational savings are produced if one sets oj to zero. This has 
the effect of reducing the number of tuning parameters to two by restricting 
FLiRTI to assume sparsity in only one derivative. We explore this option in 
the simulation section below and show that this restriction can often still 
produce good results. 

5. Simulation study. In this section, we use a comprehensive simulation 
study to demonstrate four versions of the FLiRTI method, and compare the 
results with the basis approach using B-spline bases. The first two versions of 
FLiRTI, "FLIRTIl" and "FLiRTIo," respectively use the lasso and Dantzig 
methods assuming sparsity in the zeroth and one other derivative. The sec- 
ond two versions, "FLIRTIil" and "FLiRTIio," do not assume sparsity in 
the zeroth derivative but are otherwise identical to the first two implemen- 
tations. We consider three cases. The details of the simulation models are 
as follows. 

• Case I: = (no signal). 

• Case II: (3(t) is piecewise quadratic with a "flat" region (see Figure 1). 
Specifically, 

r(t- 0.5)2 - 0.025, if 0<t< 0.342, 
P{t) = < 0, if 0.342 < t < 0.658, 

I -(t- 0.5)2 + 0.025, if 0.658 <t<l. 

• Case III: f3{t) is a cubic curve (see Figure 3), that is, 

I3{t)=t^ -1.6t^ + 0.76t + l, 0<t<l. 

This is a model where one might not expect FLiRTI to have any advantage 
over the B-spline method. 

In each case, we consider three different types of X{t). 

• Polynomials: X{t) = uq + ait + a2t^ + 03*^, < t < 1. 

• Fourier: X{t) = ao + aisin(27rt) + a2cos(27rt) + 03sin(47rt)+a4cos(47rt),0 < 
t< I. 

• B-splines: X{t) is a linear combination of cubic B-splines, with knots at 
1/7,..., 6/7. 

The coefficients in X{t) are generated from the standard normal distribu- 
tion. The error term e in (1) follows a normal distribution A^(0,cr2), where 
cr^ is set equal to 1 for the first case and appropriate values for other cases 
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Table 1 

The columns are for different methods. The rows are for different X{t). The numbers 
outside the parentheses are the average MSEs over 100 repetitions, and the numbers 
inside the parentheses are the corresponding standard errors 





B-spline 


FLIRTIl 


FLIRTId 


FLiRTIiL 


FLiRTIiD 


Case I (xlO"^) 










Polynomial 


2.10 (0.14) 


0.99 (0.14) 


0.38 (0.09) 


1.5 (0.16) 


0.52 (0.11) 


Fourier 


1.90 (0.18) 


0.53 (0.09) 


0.47 (0.11) 


1.40 (0.15) 


0.50 (0.10) 


B-spline 


2.40 (0.32) 


0.82 (0.14) 


0.45 (0.11) 


1.40 (0.22) 


0.57 (0.14) 


Case II (xlO^^ 


') 










Polynomial 


1.20 (0.12) 


0.85 (0.09) 


0.72 (0.09) 


0.92(0.09) 


0.92 (0.08) 


Fourier 


3.90 (0.32) 


3.40 (0.27) 


3.30 (0.29) 


3.50 (0.30) 


3.60 (0.28) 


B-spline 


0.52 (0.03) 


0.44 (0.03) 


0.37 (0.03) 


0.43 (0.03) 


0.46 (0.03) 


Case III (xlO" 












Polynomial 


0.96 (0.11) 


0.60 (0.07) 


0.74 (0.08) 


0.57(0.08) 


0.66 (0.08) 


Fourier 


0.79 (0.11) 


0.43 (0.05) 


0.62 (0.06) 


0.44 (0.05) 


0.46 (0.06) 


B-spIine 


0.080 (0.007) 


0.066 (0.007) 


0.074 (0.008) 


0.063(0.005) 


0.070 (0.008) 



such that each of the signal to noise ratios, Var(/(X))/ Var(e), is equal to 4. 
We generate n = 200 training observations from each of the above models, 
along with 10,000 test observations. 

As discussed in Section 4.3, fitting FLiRTI requires choosing three tuning 
parameters. A, u; and d, the second derivative to penalize. For the B-spline 
method, the tuning parameters include the order of the B-spline and the 
number of knots (the location of the knots is evenly spaced between and 
1). To ensure a fair comparison between the two methods, for each training 
data set, we generate a separate validation data set also containing 200 
observations. The validation set is then used to select tuning parameters 
that minimize the validation error. Using the selected tuning parameters, 
we calculate the mean squared error (MSE) on the test set. We repeat this 
100 times and compute the average MSEs and their corresponding standard 
errors. The results are summarized in Table 1. 

As we can see, in terms of prediction accuracy, the FLiRTI methods per- 
form consistently better than the B-spline method. Since the FLiRTIi meth- 
ods do not search for "flat" regions their results deteriorated somewhat for 
Cases I and II over standard FLiRTI, correspondingly the results improve 
slightly in Case III. However, in all cases all four versions of FLiRTI outper- 
form the B-spline method. This is particularly interesting for Case III. Both 
FLiRTI and the B-spline method can potentially model Case III exactly but 
only if the correct value for d is chosen in FLiRTI and the correct order 
in the B-spline. Since these tuning parameters are chosen automatically us- 
ing a separate validation set neither method has an obvious advantage yet 
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Table 2 

The table contains the percentage of truly identified zero region. The rows are for 
different X{t). The numbers outside the parentheses are the averages over 100 
repetitions, and the numbers inside the parentheses are the corresponding 

standard errors 

FLiRTiL FLiRTIo 

Case I 

Polynomial 61% (6%) 65% (6%) 

Fourier 91% (2%) 71% (6%) 

B-spline 54% (6%) 74% (6%) 

Case II 

Polynomial 70% (6%) 70% (6%) 

Fourier 72% (5%) 72% (5%) 

B-spIine 58% (5%) 59% (5%) 



FLiRTI still outperforms the B-spline approach. The lasso and the Dantzig 
selector implementations of FLiRTI perform similarly, with the Dantzig se- 
lector having an edge in the first case, while lasso has a slight advantage in 
the third case. Finally, for Cases I and II, we also computed the fraction of 
the zero regions that FLiRTI correctly identified. The results are presented 
in Table 2. 

6. Canadian weather data. In this section we demonstrate the FLiRTI 
approach on a classic functional linear regression data set. The data con- 
sisted of one year of daily temperature measurements from each of 35 Cana- 
dian weather stations. Figure 4(a) illustrates the curves for 9 randomly se- 
lected stations. We also observed the total annual rainfall, on the log scale, 
at each weather station. The aim was to use the temperature curves to pre- 
dict annual rainfall at each location. In particular, we were interested in 
identifying the times of the year that have an effect on rainfall. Previous 
research suggested that temperatures in the summer months may have little 
or no relationship to rainfall whereas temperatures at other times do have an 
effect. Figure 4(b) provides an estimate for /3(t) achieved using the B-spline 
basis approach outlined in the previous section. In this case we restricted 
the values at the start and the end of the year to be equal and chose the 
dimension, Q' = 4, using cross-validation. The curve suggests a positive rela- 
tionship between temperature and rainfall in the fall months and a negative 
relationship in the spring. There also appears to be little relationship during 
the summer months. However, because of the restricted functional form of 
the curve, there are only two points, where (3{t) = 0. 

The corresponding estimate from the FLiRTI approach, after dividing 
the yearly data into a grid of 100 equally spaced points and restricting the 
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Fig. 4. (a) Smoothed daily temperature curves for 9 of 35 Canadian weather stations, (b) 
Estimated beta curve using a natural cubic spline, (c) Estimated beta curve using FLiRTI 
approach (black) with cubic spline estimate (grey). 



zeroth and third derivatives, is presented in Figure 4(c) (black line) with 
the spline estimate in grey. The choice of A and uj were made using tenfold 
cross-validation. The FLiRTI estimate also indicates a negative relationship 
in the spring and a positive relationship in the late fall but no relationship 
in the summer and winter months. In comparing the B-spline and FLiRTI 
fits, both are potentially reasonable. The B-spline fit suggests a possible 
cos/sin relationship, which seems sensible given that the climate pattern is 
usually seasonal. Alternatively, the FLiRTI fit produces a simple and easily 
interpretable result. In this example, the FLiRTI estimate seemed to be 
slightly more accurate with 10 fold cross-validated sum of squared errors of 
4.77 vs 5.70 for the B-spline approach. 

In addition to estimates for /3{t), one can also easily generate confidence 
intervals and tests of significance. We illustrate these ideas in Figure 5. 
Pointwise confidence intervals on (3{t) can be produced by bootstrapping 
the pairs of observations {Yi,Xi{t)}, reestimating P{t) and then taking the 
appropriate empirical quantiles from the estimated curves at each time point. 
Figures 5(a) and (b) illustrate the estimates from restricting the first and 
third derivatives, respectively, along with the corresponding 95% confidence 
intervals. In both cases the confidence intervals confirm the statistical sig- 
nificance of the positive relationship in the fall months. The significance of 
the negative relationship in the spring months is less clear since the upper 
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Fig. 5. (a) Estimated beta curve from constraining the zeroth and first derivatives, (b) 
Estimated beta curve from constraining the zeroth and third derivatives. The dashed lines 
represent 95% confidence intervals, (c) from permuting the response variable 500 times. 
The grey line represents the observed from the true data. 

bound is at zero. However, this is somewhat misleading because approxi- 
mately 96% of the bootstrap curves did include a dip in the spring but, 
because the dips occurred at slightly different times, their effect canceled 
out to some extent. Some form of curve registration may be appropriate 
but we will not explore that here. Note that the bootstrap estimates also 
consistently estimate zero relationship during the summer months providing 
further evidence that there is little effect from temperature in this period. 
Finally, Figure 5(c) illustrates a permutation test we developed for testing 
statistical significance of the relationship between temperature and rainfall. 
The grey line indicates the value of (0.73) for the FLiRTI method applied 
to the weather data. We then permuted the response variable 500 times and 
for each permutation computed the new All 500 permuted R^'s were 
well below 0.73, providing very strong evidence of a true relationship. 

7. Discussion. The approach presented in this paper takes a departure 
from the standard regression paradigm, where one generally attempts to 
minimize an L2 quantity, such as the sum of squared errors, subject to an 
additional penalty term. Instead we attempt to find the sparsest solution, 
in terms of various derivatives of P{t), subject to the solution providing 
a reasonable fit to the data. By directly searching for sparse solutions we 
are able to produce estimates that have far simpler structure than that from 
traditional methods while still maintaining the flexibility to generate smooth 
coefficient curves when required/desired. The exact shape of the curve is 
governed by the choice of derivatives to constrain, which is analogous to the 
choice of the derivative to penalize in a traditional smoothing spline. The 
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final choice of derivatives can be made eitlier on subjective grounds, sucli as 
the tradeoff between interpretability and smoothness, or using an objective 
criteria, such as the derivative producing the lowest cross validated error. 
The theoretical bounds derived in Section 3, which show the error rate can 
grow as slowly as y/Togp, as well as the empirical results, suggest that one 
can choose an extremely flexible basis, in terms of a large value for p, without 
sacrificing prediction accuracy. 

There has been some previous work along these lines. For example, 
Tibshirani et al. (2005) uses an Li lasso-type penalty on both the zeroth 
and first derivatives of a set of coefficients to produce an estimate which is 
both exactly zero over some regions and exactly constant over other regions. 
Valdes-Sosa et al. (2005) also uses a combination of both Li and L2 penal- 
ties on fMRI data. Probably the work closest to ours is a recent approach by 
Lu and Zhang (2008), called the "functional smooth lasso" (FSL), that was 
completed at the same time as FLiRTI. The FSL uses a lasso-type approach 
by placing an Li penalty on the zeroth derivative and an L2 penalty on the 
second derivative. This is a nice approach and, as with FLiRTI, produces re- 
gions where P{t) is exactly zero. However, our approach can be differentiated 
from these other methods in that we consider derivatives of different order 
so FLiRTI can generate piecewise constant, linear and quadratic sections. 
In addition FLiRTI, possesses interesting theoretical properties in terms of 
the nonasymptotic bounds. 



We first present definitions of 6,9 and Cn,p{t)- The definitions of S and 9 
were first introduced in Candes and Tao (2005). 

Definition 1. Let X be an n by p matrix and let Xt,T C {1, . . . ,p} 
be the n by |T| submatrix obtained by standardizing the columns of X and 
extracting those corresponding to the indices in T. Then we define as 
the smallest quantity such that (1 — )||c||2 < ||Xtc||2 < (1 + )||c||2 for 
all subsets T with |r| < S and all vectors c of length |T|. 

Definition 2. Let T and T' be two disjoint sets with T,T' c {I, . . . ,p}, 
\T\ < S and \T'\ < S'. Then, provided S + S' < p, we define 9gg, as the 

smallest quantity such that \{Xtc)'^ Xt'c'\ < ^5 5/||c||2||c'||2 for all T and T' 
and all corresponding vectors c and c'. 



APPENDIX A: PROOF OF THEOREM 1 



Finally, let Cn,p{t) 



4Q71.p(*) 



where 



T-AV _aV 5 
^ "2S "3,23 
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Next, we state a lemma which is a direct consequence of Theorem 2 in James, 
Radchenko and Lv (2009) and Theorem 1.1 in Candes and Tao (2007). The 
lemma is utilized in the proof of Theorem 1. 

Lemma 1. LetY = X') + e, where X has norm one columns. Suppose 
that 7 is an S -sparse vector with + 9^2s < ^- 7 corresponding 



4Av^ 

II ' ~ 'II - 1= 

provided that (10) and max < A both hold. 



solution from the Dantzig selector or the lasso. Then ||7 ~ 7|| ^ "i — nX 

'-~"2S~''S,2S 



Lemma 1 extends Theorem 1.1 in Candes and Tao (2007), which deals 
only with the Dantzig selector, to the lasso. Now we provide the proof of 
Theorem 1. First note that the functional linear regression model given by 
(6) can be reexpressed as, 

(20) Y = V'r + £* = V^ + e*, 

where 7 = Dt,7 and is a diagonal matrix consisting of the column norms 
of V. Hence, by Lemma 1, ||-Dt;7 — ^ii7|| = ||7 — 7|| < ,_ .t^^v — provided 
(11) holds. 

But (3{t) = Bp{t)^A^^j = Bp{t)'^A-^D-^, while P{t) = Bp{tfr]+ep{t) = 
Bp{tfA~^D-^^ + ep{t). Then 

\m-m<\m-Bp{tfri\ + \ep{t)\ 

= \\Bp{tfA-'D~\^--f)\\ + \ep{t)\ 
<\\Bp{tfA~^D~^\\-\\^-^\\+ujp 

= ^a„,p(t)||7 - 7II +t^p 



n 



1 4:an,p{t)Xy^ 



APPENDIX B: PROOF OF THEOREM 2 

Substituting A = cJi ^2(1 + </>) logp + Muj^/n into (12) gives (13). Let 
e'j = / Xi{t)ep(t) dt. Then to show that (11) holds with the appropriate prob- 
ability note that 

\V[£*\ = \V]^£ + V]^£'\ < \V^e\ + \Vje'\ 

= ai\Zj\ + \V[e'\ < ai\Zj\ + Mujy/n, 
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where Zj ~ A^(0,1). This result follows from the fact that Vj is norm one 
and, since Ei ~ A^(0,cri), it will be the case that ~ -^(0,cti). Hence 

P (^max I Vje* | > = P (^max | Vje*' \ > cJi y^2(l + (/)) logp + Mu^^n^ 



< P|^max|Zj| > y'2(1 +0)logp 



< "P-j^ exp{-(l + (/>) logp}/ + 0) logp 



(p^/4(l + 0)^logp)~\ 



The penultimate line follows from the fact that P(supj |Zj| > n) < x 



exp(-«V2)- 

In the case, where e* ~ iV(0,o"J) then substituting A = C2\/(l + (fy log^* 
into (12) gives (14). In this case 

y^;^^* ^ ^^Zj, where Z^- ~ iV(0, 1). Hence 



p(^max|F/e*| > o-2\/ (1 + </>) logpj =-P(^max|Zj| > -y/2(l + (/>) logp 
and the result follows in the same manner as above. 

APPENDIX C: THEOREMS 5 AND 6 ASSUMING GAUSSIAN e* 



The following theorems hold with \ = ^^2(1 + logp. 

Theorem 5. Suppose A-1 through A-5 all hold and e* ~ N{0, cjI). T/ien, 
with arbitrarily high probability, 

|^n(t)-/3(t)|<0(n"V2)+ ^ 



p*m 



sup|^„(t)-/3(t)|<0(n-i/2) + ^ 



Theorem 5 demonstrates that, with the additional assumption that e* ~ 
A^(0, (T2), asymptotically the approximation error is now bounded by H/p*"^, 
which is strictly less than En{t). Finally, Theorem 6 allows p to grow with 
n, which removes the bias term, and hence (3n{t) becomes a consistent esti- 
mator. 

Theorem 6. Suppose we assume A-6 as well as e* ~ iV(0,cj|). Then if 
we let p grow at the rate of n^^^'^"^~^'^^*\ the rate of convergence improves to 
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or if we let p grow at the rate of j^i/(2"i+2c)^ supremum converges at a 
rate of 



sup\P{t)-Pit)\=0 
t 



nl/2{m/{m+c)) 



APPENDIX D: PROOFS OF THEOREMS 3-6 



Proof of Theorem 3. By Theorem 2, for p = p*, 
1 



n 



\P{t) - P{t)\ < -^Cn,p*{t)aiJ2Sp,{l + (A)logp* +a;p.{l + Cn,p*it)M J Sp*} 



with arbitrarily high probabihty provided <j) is large enough. But, by A-1, 
Sp* is bounded, and, by A-3 and A-5, Cn,p*{t) is bounded for large n. Hence, 
since p* ,(Ti and cj) are fixed, the first term on the right-hand side is 0(n~^/^). 
Finally, by A-2, cjp* < H/p* and, by A-1, Sp* < S so the second term of 
the equation is at most En{t). The result for sup^ — can be proved 
in an identical fashion by replacing A-3 by A-4. □ 

Proof of Theorem 4. By A-1 there exists S <oo such that Sp < S 
for all p. Hence, by (13), setting = 0, with probability converging to one 
as p^ DO, 

\m-m\ < '^iV^swFn 



PnV^OgPn 4p-'"Q„,p„(t) 



Pn V ^-°2S-^S,2S 

\/logn 



^l/2-6t/(2m) 



where K is 



Pn / logp„ 4p-^'an,p„it) ^^^^ 



(21) 
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But if we let pn = 0{n^^'^"^) then (21) is bounded because pn/n^^"^"^ and 
logpn/logra are bounded by construction of pn, ^p^Pn is bounded by A-2, 
Pn''*ctn,pn{t) is bounded by A-3 and (1 — 5^ — ^525)"^ bounded by A-6. 

Hence \Pn{t) - (3{t)\ = 0{ i/^i°f/( ^)- With the addition of A-4 exactly the 



same argument can be used to prove sup^ \(3{t) — /3{t)\ = 0( ^i/^-°/(2^) )■ D 

Proof of Theorem 5. By Theorem 2, if we assume e* ~ N{0,a2), 
then, for p = p* , 



\P{t)- Pit) I < -j^Cn,p* {t)a2pSp* (! + (/)) log p* + Wp* 



with arbitrarily high probability provided (f) is large enough. Then we can 
show that the first term is 0(n~^/^) in exactly the same fashion as for the 
proof of Theorem 3. Also, by A-2, the second term is bounded by H/p* . 
□ 



Proof of Theorem 6. If e* ~ iV(0,(T2) then, by (14), setting 
with probability converging to one as 00, 

where 

ht 



0, 



K 



Pn 



(22) 



+ 



^(l/(2(m+fei))) 
Pn 



I log Pn Ap^^'an^p^jt) 

\ogn l-6Y^-6^ 



■cr2V 25* 



5,25 



j^(l/(2(m+bt))) 



-m m 
Pn 



\/Iogn 



Hence if pn = 0(n-'^/^'^'"+'')) then (22) is bounded, using the same argu- 



ments as with (21), so |/?„(t) - (5{t)\ = 0( ^i/2^/°l+6^)) )■ We can prove that 



supt \(in{t) - f3{t)\ = 0( „i/2yJ/f"+c)) ) in the same way □ 



APPENDIX E: PROOF OF COROLLARY 1 

Here we assume that A is produced using the standard second difference 
matrix, 



(23) 



A = p' 





ri/p2 








. 


. 





0" 




-i/p 


i/p 





. 


. 










1 


-2 


1 


. 


. 








p' 





1 


-2 


1 . 


. 
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Throughout this proof let rjk = [3{k/p). First we show A-1 holds. Suppose 
that /3"(t) = for all t in Rk-2,Rk-i and Rk then there exist bo and bi such 
that P{t) = 60 + bit over this region. Hence, for k>2, 

Ik = P^{i]k-2 - 2r/fc-i + i]k) 

= p\m - 2)/p) - 2/3((fc - l)/p) + P{k/p)) 
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But note that if P"{t) 7^ at no more than S values of t then there are at 
most 35 triples such that P"{t) / for some t in Rk-2, Rk-i and Rk- Hence 
there can be no more than 35 + 2 j^'s that are not equal to zero (where the 
two comes from 71 and 72). 

Next we show A-2 holds. For any t G R^, B{t)'^ri = rj^. But since < 
G for some G < 00 and R^ is of length \/p it must be the case that 
sup^gjij^ f3{t) — inftg/jj, j3{t) < G/p. Let 77^ be any value between sup^gjij^ (5{t) 



and initeRk f^i^) ^'-'^ k = 1, 
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< max{ sup Pit) - inf P{t) } < G/p 

so A-2 holds with m = 1. 

Now, we show A-3 holds. For t G Rk let Lnj{t) = ^ Er=i(^ Ef=i Xu- 

where Ajj} is the l,kth element of A~^ and Xn is the average of Xi{t) in 
Ri, that is, p Jj^^Xi{s)ds. Then a„,p(t) = ^YJ'j=iLnj{t)~'^ . Since Xi{t) is 

bounded above zero Lnj{t) > W^{^Y4=i 7^)^ some > 0. It is easily 
verified that 
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and, hence. 
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except for j = 1 in which case the ratio equals one for all I and k. For t = 
then t G i?i (i.e., A; = 1) and, hence, -Lni(O) > W| while Lnj{0) = oo for all 
j > 1. Therefore, an,p(0) < for all p. Hence, A-3 holds with 5o = 0. 

Alternatively, for < i < 1, then k = [pt\ and Lnj{t) = oo for j > k. Hence 
for j < k, 

= H^2l/b-J + l)(p-i + 2)^' 



' p'^\ 2(fc - j + 1 



Therefore 



2 



<„l/2 ? ^3/2 

(1-tm 

since /c = [ptj . Hence A-3 holds with bt = 1/2 for < i < 1. 

Finally, note that (25) holds for any t <1 and is increasing in t so 

2 

sup an,p{t) <p^^^—— 

0<t<l-a aWe 

for any a > and hence A-4 holds with c = 1/2. 
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